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Abstract 

Background: Semantic similarity searches in ontologies are an important component of many bioinformatic 
algorithms, e.g., finding functionally related proteins with the Gene Ontology or phenotypically similar diseases 
with the Human Phenotype Ontology (HPO). We have recently shown that the performance of semantic similarity 
searches can be improved by ranking results according to the probability of obtaining a given score at random 
rather than by the scores themselves. However, to date, there are no algorithms for computing the exact 
distribution of semantic similarity scores, which is necessary for computing the exact P-value of a given score. 

Results: In this paper we consider the exact computation of score distributions for similarity searches in ontologies, 
and introduce a simple null hypothesis which can be used to compute a P-value for the statistical significance of 
similarity scores. We concentrate on measures based on Resnik's definition of ontological similarity. A new 
algorithm is proposed that collapses subgraphs of the ontology graph and thereby allows fast score distribution 
computation. The new algorithm is several orders of magnitude faster than the naive approach, as we demonstrate 
by computing score distributions for similarity searches in the HPO. It is shown that exact P-value calculation 
improves clinical diagnosis using the HPO compared to approaches based on sampling. 

Conclusions: The new algorithm enables for the first time exact P-value calculation via exact score distribution 
computation for ontology similarity searches. The approach is applicable to any ontology for which the annotation- 
propagation rule holds and can improve any bioinformatic method that makes only use of the raw similarity 
scores. The algorithm was implemented in Java, supports any ontology in OBO format, and is available for non- 
commercial and academic usage under: https://compbio.charite.de/svn/hpo/trunk/src/tools/significance/ 



Background 

Ontologies are knowledge representations using controlled 
vocabularies that are designed to help knowledge sharing 
and computer reasoning [1]. Many ontologies can be 
represented by directed acyclic graphs (DAGs), whereby 
the nodes of the DAG, which are also called terms of the 
ontology, are assigned to items in the domain and the 
edges between the nodes represent semantic relations. 
Ontologies are designed such that terms closer to the root 
are more general than their descendant terms. For the 
ontologies we consider in this paper, the annotation- 
propagation rule applies, that is, items are annotated to 
the most specific term possible but are assumed to be 
implicitiy annotated to all ancestors of that term. 
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Examples for ontologies are the Foundational Model of 
Anatomy (FMA) ontology [2], the Sequence Ontology [3], 
the Cell Ontology [4], and the Chemical Entities of Biologi- 
cal Interest (ChEBI) ontology [5], which describe objects 
from the domains of anatomy, biological sequences, cells, 
and biologically relevant chemicals. In contrast, other 
ontologies are used to describe the attributes of the items 
of a domain. For instance, GO terms are used to annotate 
genes or proteins by describing the biological functions or 
characteristics to the proteins. The Mammalian Phenotype 
Ontology (MPO) [6] and the Human Phenotype Ontology 
(HPO) [7] describe the attributes of mammalian and 
human diseases. In this case, the domain object is a disease 
such as Marfan syndrome, whose attributes are the clinical 
features of the disease such as arachnodactyly and aortic 
dilatation. In other words, terms of phenotype ontologies 
such as the MPO and HPO can be conceived of as 
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describing abnormal qualities (e.g. hypoplastic) of anatomi- 
cal or biochemical entities [8]. 

Semantic similarity between any two terms within an 
ontology is based on the annotations to items in the 
domain and on the structure of the DAG. Different 
semantic similarity measures have been proposed [9,10] 
and the measures have been used in many different appli- 
cations in computational biology. For example, different 
studies show that semantic similarity between proteins 
annotated with GO terms correlate with sequence simi- 
larity [11-13]. Other studies investigated the correlation 
of gene coexpression with semantic similarity using GO 
terms [14,15]. In addition, semantic similarity measures 
for GO terms have been used to predict protein subnuc- 
lear localization [16]. 

In another application we have implemented a semantic 
similarity search algorithm in the setting of medical diag- 
nosis. A user enters HPO terms describing the clinical 
abnormalities observed in a patient and a ranked list of 
the best matching differential diagnoses is returned [17]. 
This kind of search can be performed using raw semantic 
similarity scores calculated using any of the semantic simi- 
larity measures [18-21,12,22]. However, among these 
different measures the node-based pairwise similarity 
measure defined by Resnik turned out to have the best 
performance in our previous study [17] and is therefore 
considered in this work. 

The search is based on q attributes (HPO terms) that 
describe the phenotypic abnormalities seen in a patient for 
whom a diagnosis is being sought. For each of the entries 
of a database containing diseases annotated with HPO 
terms corresponding to their characteristic signs and 
symptoms, the best match between each of the q terms of 
the query with one of the terms annotating the disease is 
found and the average of the semantic similarity scores is 
determined. The diseases are then ranked according to 
these scores and returned to the user as suggestions for 
the differential diagnosis. 

The distribution of scores that a domain object can 
achieve varies according to the number and specificity of 
the ontology terms used to annotate it. In a recent study 
by Wang et al. [23], it was discovered that many of the 
commonly used semantic similarity measures, including 
the ones used in this work, are biased towards domain 
objects that have more annotations. The effect was 
termed annotation bias. Applications that use the scores 
alone therefore tend to preferentially select items with 
higher numbers of annotations, which may lead to wrong 
conclusions [23]. 

Previously, we developed a statistical model to assign 
P-values to the resulting similarity scores on the basis of 
the probability of a random query obtaining at least as 
high a score in order to compensate for the fact that 



different domain objects may have a different number of 
annotations. Using extensive simulations, we showed that 
this approach outperformed searches based on the 
semantic similarity scores alone [17]. A disadvantage of 
that procedure was the fact that extensive simulations 
using randomized queries were necessary in order to esti- 
mate the true distribution of the semantic similarity 
scores, which is needed in order to calculate a P-value for 
any given similarity score. 

In this paper, we describe an algorithm to collapse a 
DAG representing an ontology into connected compo- 
nents of nodes corresponding to terms that make identical 
contributions to the semantic similarity score. The new 
algorithm reduces the amount of computational time 
needed to calculate the score distribution (and thereby 
P-values) by many orders of magnitude compared to a 
naive calculation. A preliminary description of the algo- 
rithm was presented in a conference paper [24] . Here, we 
validate the algorithm by comparing to sampling based 
approaches and show using simulations that the applica- 
tion of the exact P-value outperforms sampling based 
approaches in the context of clinical diagnostics with the 
HPO. 

Methods 

Notation 

We consider an ontology O composed of a set of terms 
that are linked via an is-a or part-of relationship. The 
ontology O can then be represented by a DAG Q = (V, E), 
where every term is a node in V and every link is a direc- 
ted edge in E. A directed edge going from node «i to n 2 
is denoted e 12 and we refer to n 2 as the parent of n x . An 
item i is defined as an abstract entity to which terms of 
the ontology are annotated. Let Anc(n) be defined as the 
ancestors of n, i.e., the nodes that are found on all paths 
from node n to the root of Q, including n. We note that 
the annotation-propagation rule states that if an item is 
explicitly annotated to a term n, it is implicitly annotated 
to Anc(n). In order to describe the implicit annotations 
we define fiMPL, l 6 1; 7~ be the set of terms that has been 
explicitly annotated to item i, then J~ mpL - U ne -j-Anc(n), 
namely all terms that are annotated to item i and all their 
ancestors in Q. Let the set of common ancestors of two 
nodes n x and n 2 be defined as ComAncin^ n 2 ) = Anc(n^) 
n Anc{n 2 ). Let Desc(n) be the set of descendant nodes of 
n, again including n. Note that in this notation descen- 
dant nodes are considered only once, even if there are 
multiple paths leading to them. 

Multisets 

In what follows we need to compute the similarity also 
between a multiset and a set of terms. The concept of 
multisets [25] is a generalization of the concept of sets. 
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In contrast to sets, in which elements can only have a 
single membership, the elements of multisets may 
appear more than once. 

Formally, a multiset M is a set of pairs, M = {(si, 
Wj),..., (s d , m d )}, in which s t e U = {si,..., s d } are the ele- 
ments of the underlying set U. Furthermore, m, defines 
the multiplicity of s, in the multiset. The sum of the 
multiplicities of M is called the multiset cardinality of 
M, denoted \M\. Only multiplicities in the domain of 
positive integers are considered, i.e., m t e N + . We define 
a multi subset relation between multiset N and multiset 
M, denoted as N Q M, as a generalization of the subset 
relation between two sets: 

N c M V(s, n) e N : 3m > n : (s, m) e M. 



The multiset coefficient M(n,q) = 



n + q — 1 



denotes 



the number of distinct multisets of cardinality ^, with 
elements taken from a finite set of cardinality n. It 
describes how many ways there are to choose q ele- 
ments from a set of n elements if repetitions are 
allowed. 

Similarity measures 

We will concentrate in this work on the class of similar- 
ity measures that are based on the information content 
(IC) of a node: 



IC(n) = — log p{n), 



(1) 



where p(n) denotes the frequency among all items in 
the domain of annotations to n, which implicitly con- 
tains all annotations of descendants of n due to the 
annotation-propagation rule. The information content is 
a nondecreasing function on the nodes of Q as we des- 
cend in the hierarchy and is therefore monotonic. The 
similarity between two nodes was defined by Resnik as 
the maximum information content among all common 
ancestors [19]: 



sim(ni,n 2 ) = max{IC(a)\a e ComAnc(n\,n 2 )}. 



(2) 



Equation (2) provides a definition for the similarity 
between two terms. Other popular pairwise measures 
that additionally incorporate the IC of the query terms, 
for example [20,21], are not considered here (see 
Discussion). 

One can extend this concept to define a similarity 
between two domain objects that are each annotated by 
multiple ontology terms by taking the average of the 
best pairwise similarities for all terms [11]: 



sim avg (T\,T 2 ) = 7=- max sim[tii, n 2 ). 



(3) 



Note that Eq. (3) is not symmetric [12], i.e., it is not 
necessarily true that sim" vg {Ti , T 2 ) = sim avg {T 2 ,Ti). We 
point out that in other works average often refers to a 
symmetric definition. Using the nomenclature of Pes- 
quita et al. [9], Eq. (3) may be referred to as asymmetric 
best-match average, here average for short. 

Instead of taking the average the maximum similarity 
between a term annotating one of the domain objects 
and a term annotating the other domain object can be 
used to define the following symmetric measure: 



sim max [Ti,T 2 ) 



max sim(n\,n 2 ). 

n 1 eTi,n 2 eT2 



(4) 



Equation (4) can be considered a simplified case of Eq. 
(3) because instead of averaging over all best-pairwise 
terms for each m e 71 compared to n 2 € T 2 only the 
highest similarity of all possible pairs is retained. There- 
fore, we will show the algorithm applied to Eq. (3) and 
sketch the changes for Eq. (4) later. One can use equation 
(3) or (4) to define a similarity between a set of query 
terms Q, i.e., 7i = Q and an object in a database. Then, Q 
can represent any set of terms from the ontology O 
whereas T 2 refers to database objects (such as diseases 
annotated to HPO terms). As we are using this setup for 
the similarity queries we will omit the index and refer to 
7~ 2 as the target set T. See Figure 1 for an example com- 
putation of sim avg . 

Because we later make use of scores derived at the 
maximization step in Eq. (3) we define: 



sim(ni,T) = maxsim(ni, n 2 ), 

n 2 eT 



(5) 




Figure 1 Example Computation of sim"" 9 . Computation of sim av9 
on a DAG with six nodes. The target set is 7" = {B, C} (black 
nodes) and the query set Q = {D, F] (nodes with horizontal lines). 
The IC value of a node is shown in a small, dashed, attached oval. 
The most similar terms for D and F are 8 and C respectively, 
because /C(fi) > IC(A) and IQQ > IC(B). Therefore, 



sim[D, B) + sim{F, C) IC{B) + JC(C) 2 + 2.5 



2 2 2 

that only terms involving nodes in T'lMPL _ £j were 

considered in the calculation. 



Note 
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to be the target set similarity score of against a 
target set T, To avoid confusion we will denote scores 
of the score distribution of sim avi by S and target set 
similarity scores sim(n, T) by s. 

Definition of statistical significance for semantic similarity 
scores 

In this paper we will present methods for analytically cal- 
culating the probability distribution of similarity scores 
for comparisons between a query set Q with q terms 
against an item that has been annotated with a target set 
T of nodes. For example, if a clinician chooses a set Q of 
HPO terms describing abnormalities seen in a patient 
and uses Eq. (3) to calculate an observed score S a b s to a 
disease that has been annotated with terms of the HPO, 
we would like to know the probability of a randomly cho- 
sen set of q nodes achieving a score of S a b s or greater. In 
this case, each disease in the database represents a target 
set (for instance, there are currently over 5000 diseases in 
the clinical database used by the Phenomizer at the HPO 
Web site). 

In other words, our methods will be used to calculate a 
.P-value for the null hypothesis that a similarity score of 
S 0 b s or greater for a set of q query terms Q and a target 
set 7" has been observed by chance. We take all queries 
to be equally likely and define the P-value to be the pro- 
portion of queries having a score of at least S 0 b s - 

pSira , c ^ 5 , \(Q\sim{Q,T) > s obs ,Q = [ ni ,..., nil } £V}\ 
P * rlS - *' = L^j ■ (6) 

In this definition all nodes of V can be part of a query, 
even if one node is an ancestor of the other. Note that 
the number of distinct scores for the complete score 
distribution of P%j- is dependent on cj, T, and the simi- 
larity measure. 

Simulation of patients for clinical diagnosis 

Similar to our previous work [17], we use simulations to 
compare different approaches. Using 1701 OMIM dis- 
eases currently annotated with 2-5 HPO terms in the 
Phenotypic abnormality subontology, we generated arti- 
ficial queries by (i) taking all terms annotated to the dis- 
ease with no noise or imprecision as the query (NONE), 
(ii) randomly exchanging one term if q = 3 or q = 4 and 
two terms if q = 5 (NOISE), (Hi) with probability 0.5 
exchange a term with one of its parent terms if possible, 
(IMPRECISION), or (iv) using first IMPRECISION then 
NOISE. 

For each of the 1701 OMIM diseases we generate the 
query as described above and rank all diseases using one 
of the measures (Score, P-value sampled 10 3 , 10 4 , or 10 s 
times, and P-value exact). We then calculate the rank of 
the disease from which the query was generated. In case 



of ties we take the average rank (e.g. if four diseases 
rank first with the same value, all four get rank 2.5). 
Note that for the rankings using P-values (sampled or 
exact) we ranked first by P-values and then by score. 

Results 

A naive algorithm: exhaustive computation of score 
distributions 

We represent the score distribution as 
SV = {(Si, [Sk,F k )}. Every pair (S;,F;) 6 SV con- 
tains a unique score 5, and a count F t that defines its 
frequency within the distribution. 

A naive approach to calculating the complete score 
distribution is to determine the similarity of each possi- 
ble term combination Q £ V of size q with the fixed tar- 
get set T . The complete procedure is outlined in 
Algorithm 1. It requires two basic operations that are 
applied to the set SV- The first operation called getScor- 
ePair returns the pair that represents the given score or 
nil in case the entry does not exist. The second opera- 
tion denoted putScorePair puts the given pair into the 
set SV, overwriting any previously added pair with the 
same score. For further analyses we assume that both 
operations have constant running time. 
Input: V, q, X 

Output: Score distribution 
SV = {(S 1 ,F 1 ),...,{S k ,F k )} 

ISV = 9 

2 foreach Q = n 2 ,..., n q } £ V do 

3 S new <- sim av $(Q,T) 

4 (S, F) <- getScorePair(ST>, S new ) 

5 if (S, F) nil then 

6 putScorePair{SV , (S new , F + 1)) 

7 else 

8 putScorePair{SV , (S new , 1)) 

9 return SV 

Algorithm 1: Naive score distribution computation 

for sim avg 



As the number of possible term combinations is 




and each similarity computation (line 3) costs 0{q ■ |T|) 
operations for Eq. (3) Algorithm 1 runs in C(|V|'' • q ■ \T\) 
time. A typical size of | V\ = 10000 as for the HPO demon- 
strates that the naive approach is impractical for values q 
> 2. The naive approach neglects the relationships of the 
nodes in Q and T, We will exploit these relationships in 
the next section and group nodes in Q according to their 
contribution to the score distribution computation. 

A faster algorithm: exploiting redundant computations 

Recall that all terms from the target set T are contained 
in 7~™pl w e w [\[ p r0 ve now that only the IC values of 
nodes in fiMPL are relevant for the score distribution 
computation. 
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Lemma 1. Given a DAG Q = (V,E)and a target set 
T = {n\, . . . , Hfe} c V, all scores in the score distribution 
of the similarity measure of Eq. (?>) are derived from IC 
values of the nodes in 7~™pl 

Proof. Computing the complete score distribution 
involves repeatedly evaluating sim av Z{Q,T) in Alg. 1 
using equation (3). The first step for the computation of 
Eq. (3) is to maximize sim(n\,n 2 ) for each node ri\ e Q 
compared to nodes n 2 6 T ■ The maximum IC value for 
sim{ni, n 2 ) must be taken from a node in j~mpl i because 
by definition Anc{n 2 ) C f IMPL - 

Lemma 1 implies that the computations in the naive 
algorithm, which enumerates all nodes in V, are highly 
redundant as the size of j^mpi [ s an upper bound on 
the number of different target set similarities encoun- 
tered during score distribution computation. Figure 2 
shows the contribution of all possible queries of size q = 
2 for an example ontology. For instance, whenever node 
C or D are part of a query the target set similarity score 
obtained from Eq. (5) is IC(C) = 4, highlighted in red in 
Figure 2, and used for computing sim a " s (Q,T). 

Therefore, instead of enumerating over the nodes in 
V, we will first group nodes that have the same target 
set similarity score s in the maximization step in Eq. (3). 
Denote all nodes n e V that have the same target set 
similarity score s for a given target set f as N s : 



N s = {n\n e V,sim(n,T) = s] 



(7) 



Example 1. It can be seen in Figure 2 that N 0 = {A}, 
N 2 = {B}, and N 4 = {C, D} for Q with T = {A, B, Q. 

Observe that two nodes n i( rij e f IMPL , n, / rip belong 
to the same set N„ if /C(«,-) = IC(nj). This observation 



will be essential when we devise an algorithm for com- 
puting N s . 

The intuition behind the fast computation is that 
instead of selecting combinations of all nodes of V and 
constructing the score distribution one by one, we focus 
on the combinations of different target set similarity 
scores s and use their frequency \N S \ to avoid redundant 
enumeration. For any 7" the set U of distinct target set 
similarity scores is defined as: 



U = {lC{n)\n € T" 



(8) 



Instead of considering sets of nodes in V we will now 
consider multisets M q of target set similarity scores in 
U, where \M q \ = q. In order to do that we define as M. 
the multiset induced by all target similarity scores s and 
their corresponding multiplicities m, that is, 



M = {(si, mi), {s d , m d )\Si €U,mi = \N si \ 



(9) 



Then M^ ;; represents the set of all multi subsets of J\A 



that have multiset cardinality q, i.e., 
M q all = {M q \M q c M, IM'I = q}. 



(10) 



The value of sim avs computed for a particular M q is 
the same for all query sets of nodes that correspond to 
M q (see Figure 2, Example 2). Therefore, if we can cal- 
culate the number of such sets as well as the score cor- 
responding to each multiset M q of target set similarity 
scores in U, we can determine the distribution of simi- 
larity scores sim avg for all possible queries of any given 
size q. 




sim av 9({A,B},T)= 
sim av 9({A,C},T)= 
sim av 9({A,D},T)= 
sim av 9({B,C},T)= 
sim av 9({B,D},T)= 
sim av 9({C,D},7)= 



sim(A,A) +sim(B,B ) )*Q.5=(0 + 2] 

sim(A,A)+sim(C,C) )*0.5=(0 + 4] 

sim(A,A)+sim(D,C) )*0.5=(0 + 4] 

sim(B,B)+sim(cH) )*0.5=(2 + 4] 

sim(B,B)+sim(D,C) )*0.5=(2 + 4] 

*0.5=(4 + 4; 



*0.5=1 
*0.5=2 
*0.5=2 
*0.5=3 
*0.5=3 
*0.5=4 



Figure 2 Redundancy in Naive Score Distribution Computation with sim"" 9 for Queries of Size Two. Computation of the score distribution 
for sim"" 9 on a DAG Q with four nodes for all possible queries of size two. The target set 7" = {A, B, C} is shown as black nodes. Note that 
7" = 7"' M fi- here. The IC value for nodes is shown in a small dashed oval. All computations of Eq. (5) that result in the same target similarity 
score are colored in blue, green, and red for the target set similarity scores 0, 2, and 4, respectively. 
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Denote the similarity for a multiset M q as: 

n av &< 



sim 



t{M.1) = - m-s. 

^ {s,m)eM'l 



(ID 



The number of ways of drawing m nodes from a com- 
ponent of size |JV S | can be calculated using the binomial 
coefficient. The total number of combinations is then 
the product of all binomial coefficients, denoted as the 
multiset frequency for a multiset M q : 



freq{M«) = n 

(s,m)eM c i 



IN, 
m 



(12) 



Example 2. In total there are 2 query sets with 
sim av s{Q,T) = 2 for the DAG in Figure 2, namely {A, Q, 
{A, D}. After preprocessing, we obtain N 0 = {A}, N 2 = {B}, 
and N 4 = {C, D} (Example 1). Alg. 2 enumerates all valid 
multisets of cardinality 2 for the sets N s considering their 
size | N s | . The only way of attaining an average score of 2 
is to select one node out of N 0 and N 4 , represented by the 
multiset M 2 = {(0,1), (4,1)} for which sim avg (M 2 ) = 2. The 
multiset frequency of M 2 gives the same result as shown 

in Figure 2, freq{M 2 ) = ( lN ° l W ^ ) = 1 • 2 = 2. 

Instead of iterating over two sets we consider one multiset. 

Theorem 1. Let SV = {(Si, Fx), . . . , {S k , F k )}be the 
score distribution computed with sim avg for an ontology 
DAG Q = {V,E), target set f C Vand query size q. The 
frequency F with which any given score S occurs amongst 
all possible queries of size q is then: 



E 



freq(Ml). 



MleMl,,, sim a,, x(M1)=S 



(13) 



A proof of Theorem 1 is provided in Appendix A and 
a faster algorithm based on Theorem 1 is shown in Alg. 
2. We enumerate all distinct multisets of M^ ;; and add 
their frequency to the score distribution SV: instead of 
iterating over all sets of size q in Alg. 1, thereby redu- 
cing the number of operations. In order to apply the 
algorithm to score distribution computation for sim max , 
line 3 of Alg. 2 needs to be replaced. Instead of comput- 
ing the average of all scores in the multiset, the maxi- 
mum among them is assigned to S new . 

Preprocessing of the DAG for faster computation 

So far we have neglected how we can compute the 
values |Ns|,s e U but we will introduce an efficient algo- 
rithm in this section. We denote the algorithm as pre- 
processing because computation of |iV s | is independent 
of q. The preprocessing will divide the original graph 
into a set of connected components from which the |iV s | 
values can be deduced. 



Input: Ml n 

Output: Score distribution 
SV = {(S 1 ,F 1 ),...,{S k ,F k )} 

1 SV = 0 

2 foreach multiset M q e M^do 

3 S new <- sim avg (M q ) 

4 (S, F) <- getScorePair(ST>, S new ) 

5 if (S,F) * nil then 

6 putScorePair{SV , (S new , F + freq{M q y)) 

7 else 

8 putScorePair{SV, {S new ,freq{Ml))) 

9 return SV 

Algorithm 2: Faster score distribution computation 

for sim avg 

First, we invert the direction of all edges in E such 
that the edges are directed from the root towards the 
leaves of the DAG, and introduce edge weights w, , to 
the edges of Q. Let 



\lC[rn\ if n,eT mpL 

1,1 \ max{u>h,i\eh,i e E] otherwise 



(14) 



The edge weights are defined in a recursive manner. 
First, all weights of edges emerging from nodes in fiMPL 
are set. Then the maximum edge weight of all incoming 
edges for each node not in 7"' MP[ are propagated to all 
outgoing edges of the node, and as such propagated 
throughout the graph. Computing the edge weights is effi- 
ciently done after the nodes of Q have been sorted in topo- 
logical order, see Alg. 3. We now iterate across all nodes n t 
e V. For each node m € V, ni £ J~ IMPL I there is at least one 
path that leads to the node n > = argmaxsim(n„ n k ) If 

node has multiple parents, then by construction of the 
edge weights, an edge with a maximum weight will be a 
member of a path to nj. We therefore remove all other 
incoming edges. If there are multiple incoming edges with 
an identical, maximum edge weight, one of them can be 
chosen arbitrarily and the others are removed (Alg. 3, lines 
7-9). We now iterate over all remaining edges e ir j and 
remove all edges for which Hi, tij e f IMPL holds (Alg. 3, 
lines 10-12). Note that exactly \'y IMPL \ many connected 
components C, one for each n, € 7" IMPL remain. 

For all pairs of connected components such that /C(« ; ) 
= IC(n/) for m, tij G T mFL , Hi ^ Up the connected compo- 
nents Q and C, are merged to arrive at the desired sets 
N s ,s€U (Alg. 3, lines 13-16). 

All these steps are summarized in Alg. 3 and Figure 3. 

Theorem 2. Given a DAG Q = (V,E)and a target set 
T = {rii, . . . , n k ) C Vthe score distribution of Eq. (3) is 
computed by Alg. 2 and Alg. 3 in 
0{\E\ + |V| + M[\T IMPL \, q))time and space. 

Proof. The preprocessing of the DAG in Alg. 3 
involves inverting edges, topological ordering of V, 
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identify nodes in 
T IMPL , 
invert edge direction, 
set edge weights 2 
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nodes in 
-IMPL 




compute 
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components 



select edge with 
max incoming 
5 weight 2 





add size of the components to 
frequency of target set similarity 



s 


0 


2 


2.5 


frequency |Ns| 


1 


2 


3 



Figure 3 Overview of the Algorithm for Preprocessing. The general steps of Alg. 3 are shown on the DAG and 7~ of Figure 1. Nodes in 
q-lMPL are colored in black. The IC value of a node is depicted in a dashed oval. 



Input: V, T IMPL 

Output: node sets with identical target similarity 
score, i.e., N s 

1 for e V in topological order do 

2 for j in ey e E 
do /• 
Set weights */ 

3 if m 6 T IMPL then 

4 w i4 <- IC( ni ) 

5 else 

6 Wq <- max{w hJ |e fejj e E} 

7 for e V \ root do 

8 choose e/ w - e E s.t. \w hii > Wf,;i for all edges e/^e E 

9 remove all incoming edges of «, except e/,_, 

10 for e, v e E do /* 
Connected componentsQ */ 

11 if n„ rij € T JMPL then 

12 remove e 2 y from £ 

13 for 

s e {IC{n t )\n t <E T' MPL }do /"Mer- 
ging */ 

14 N s = 0 

15 foreach n, e T J,v,PL do 

16 N S <-N S U Q 

17 return AT S 

Algorithm 3: Graph preprocessing for faster 
computation 

introducing edge weights to E, removing edges in E, 
and computing the connected components of Q. This 
can be done with depth-first search (DFS) traversals of 



0(\E\ + |V|) with to a worst-case performance of 
0(\E\ + |V|) time and space. 

Algorithm 2 runs in 0{M{\T lMFL \, cj)) time and space. 
The outer foreach loop runs over all distinct multisets 
with cardinality q. The multiset coefficient M(\T IMPL \, cj) 
provides an upper bound for the number of these multi- 
sets. In each iteration the computation of the similarity 
score (line 3) and the multiset frequency, freq(M q ), have 
constant cost assuming a preprocessed lookup table for 
binomial coefficients and if common partial sim avg 
values are stored between the iterations, avoiding 
recomputation for similar multisets. In total, Alg. 2 and 
Alg. 3 run in 0[\E\ + \V\ + M(\T IMPL \, cj)) time and 
space. 

The theorem concludes the improvement to the naive 
algorithm, for example on average |7^ IMPI | ~ 38 for all 
diseases currently annotated with terms of the HPO, 
which currently has approximately 10000 terms and 
13000 relations. For instance, for a query with 5 terms, 
the naive algorithm would thus run in time proportional 
to 10000 5 • 5 • 38 = 1.9 x 10 22 , and the new algorithm 
in time proportional to 9000 + 11000 + 5 • M(38, 5) = 
4.3 x 10 6 . 

Experiments 

We now show the results of the new algorithm applied to 
the HPO [7]. In our previous work we implemented the 
Phenomizer as a system for experts in the differential 
diagnosis in medical genetics; the Phenomizer can be 
queried with a set of HPO terms to get a ranked list of 
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candidate diseases most similar to the query based on P- 
values derived from Resnik similarity scores, Eq. (3) [17]. 
However, for the Phenomizer we used Monte Carlo sam- 
pling to approximate the score distribution and we will 
investigate now the difference in using the exact P-value 
compared to sampling. 

As we are interested in ranking diseases for differential 
diagnosis we will take a similar simulation approach as 



in [17] and generate sets of artificial patients for which 
we know the OMIM disease, see Methods. In Figure 4 
the results are shown for the investigated scenarios 
NONE, NOISE, IMPRECISION, and NOISE + IMPRE- 
CISION. We compared the ranking of patients with the 
similarity score alone, sampling based P-values (10 - 
10 repetitions, the latter used in the Phenomizer), and 
exact computation using the algorithm in this work. In 



PARAMETER: NONE 



PARAMETER: NOISE 



i I i 



14,99 



3.39 



3 . 77 



3.21 



\ 



\ 



\ 



°J 



PARAMETER: IMPRECISION 



PARAMETER: IMPRECISION AND NOISE 



4.53 



4,1 



3.63 



3.11 




a a 



\ 



\ 



o- 



\ 



\ 



Figure 4 Impact of Exact P-value Computation for Clinical Diagnostics with the HPO Simulations for Clinical Diagnostics using the HPO. 
Patient (phenotype) data was simulated and queried against the complete database of all 4992 annotated diseases. The best result is obtained if 
the original disease is assigned the rank one (y-axis) by the search algorithm. Different approaches are compared (x-axis). Data were generated 
without error NONE and with NOISE (top row, left and right) and with IMPRECISION and both IMPRECISION and NOISE (bottom row, left and 
right) as explained in the Methods section. The mean rank is shown below each boxplot. 



Schulz et al. BMC Bioinformatics 201 1, 12:441 
http://www.biomedcentral.eom/1 471 -2 1 05/1 2/441 



Page 9 of 12 



all cases, using the exact P-value computation signifi- 
cantly outperforms the four alternative ranking methods 
(Mann- Whitney P-value < 0.001) and ranks the true dis- 
ease on rank one most of the time. The improvement 
for the exact score distribution computation is due to 
the fine-grained resolution especially for small P-values, 
where sampling is often underrepresented, but which 
are important for selecting the best rank (see Additional 
File 1). 

We then investigated the runtime for different q 
values as compared to using the naive algorithm and 
Monte Carlo sampling (Table 1). For that purpose we 
selected four diseases with a different number of anno- 
tated HPO terms, and therefore different size of y~iMPL t 
and show the runtime of the three approaches in milli- 
seconds. The naive algorithm cannot be utilized for q > 
2. The exact P-value computation is faster than random 
sampling with 10 repetitions for q = 2,3 and for the 
disease with only 17 terms in qr IMPL independent of the 
analyzed q. Starting from q = 4 the sampling based 
approach is faster for large |7~ JMPL | because of the huge 



Table 1 Runtime in milliseconds averaged over 20 runs 
comparing the naive, exact, and sampled distribution 
computation for q = 2,3,4, and 5 



Runtime Analysis with the HPO 










runtime in milliseconds 


OMIM ID 


iri 


j-y-JMPLi 


\u\ 


naive 


exact 


sampled* 


q = 2 














264300 


5 


17 


16 


3779 


4 


50 


613124 


7 


36 


36 


3794 


6 


53 


1 1 3450 


12 


80 


72 


3789 


6 


65 


129500 


20 


66 


61 


3702 


15 


89 


q = 3 














264300 


5 


17 


16 


~ 1.2 ■ 


10 7 4 


49 


613124 


7 


36 


36 


-1.2- 


10 7 6 


53 


1 1 3450 


12 


80 


72 


~ 1.2 ■ 


10 7 19 


66 


1 29500 


20 


66 


61 


~ 1.2 ■ 


10 7 15 


79 


q = 4 














264300 


5 


17 


16 




5 


--16 


613124 


7 


36 


36 




20 


55 


1 1 3450 


12 


80 


72 




250 


65 


1 29500 


20 


66 


61 




135 


77 


q = 5 














264300 


5 


17 


16 




7 


--18 


613124 


7 


36 


36 




141 


5--1 


1 1 3450 


12 


80 


72 




3896 


63 


1 29500 


20 


66 


61 




1776 


79 



Four OMIM diseases with a varying number of annotated HPO terms (|7~P 
were used; 264300: 17-/J Hydroxysteroid Dehydrogenase III deficiency, 613124: 
Hydrops fetalis, nonimmune, with gracile bones and dysmorphic features, 
113450: Brachydactyly-distal symphalangism syndrome, 129500: Ectodermal 
dysplasia 2, hidrotic. Entries denoted "-" were terminated after four hours. 
*Sampling with 10 s repetitions. 



size of the score distribution to be computed, but even 
for q = 5 the complete score distribution can be com- 
puted in under 4 seconds for diseases with many anno- 
tations. Note again that the average size of fiMPL [ s 3°, 
in the HPO. 

Discussion 

In this work we have tackled the unstudied problem of 
computing the score distribution for similarity searches 
with ontologies. We have devised an efficient preproces- 
sing of the underlying DAG of the ontology that reduces 
the complexity for similarity measures based on Resnik's 
popular definition of similarity [19]. We have introduced 
a new algorithm based on multiset enumeration, which 
can be applied to score distribution computation for Eq. 

(3) as well as variants based on maximum similarity Eq. 

(4) . In experiments with the HPO, as well as in theory, 
we show that the new algorithm is much faster than 
exhaustive enumeration of the score distribution or 
resampling approaches and that it is applicable to cur- 
rent ontologies. 

The algorithm we describe here can be used as a com- 
ponent of a procedure to find the best hit in a database, 
i.e., we need to calculate the score for each entry in the 
database and rank the results according to P-value. This 
allows users to enter a list of characteristics or features 
in order to identify objects whose characteristics best 
match the query using semantic similarity. We have 
implemented our algorithm in the setting of medical 
diagnostics, where the features are the signs and symp- 
toms of diseases and the domain objects are diseases. 
We have previously shown that this kind of search is 
useful for medical differential diagnosis [17]. 

Summarizing all nodes that have the same target set 
similarity score makes use of the fact that the pairwise 
similarity defined by Resnik only considers the common 
ancestors of the relevant terms (Lemma 1). Extending the 
proposed algorithm for other popular semantic similarity 
measures based on the information content of a node, 
like Jiang and Conrath or Lin [20,21], or the symmetric 
definition of Eq. (3) [12], has not been considered here as 
definition of pairwise similarity additionally incorporates 
the information content of the nodes in the query. There- 
fore, additional steps are necessary which render the 
computations more complicated. Although this can be 
considered a limitation of the current approach, we 
believe the methodology introduced here will prove use- 
ful for other measures as well. For example the term 
overlap similarity measure [22], comparably, only consid- 
ers common ancestors of query and target set terms, thus 
an algorithm with similar complexity appears possible 
from the results presented in this paper. One of the rea- 
sons why the P-value based rankings outperform the 
rankings based on scores is that the former account for 
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the annotation bias as observed by Wang et al. [23]. The 
best-match average semantic similarity measures based 
on Resnik, like Eq. (3), were shown to have a strong bias. 
The annotation bias is a further argument to use P-values 
instead of the similarity scores alone. 

In the mentioned study by Wang et al. [23], the authors 
consider the comparison of two proteins via their anno- 
tated GO terms, instead of considering any possible sub- 
set of the ontology terms as query as in our search setup. 
Their approach is to compensate for the annotation bias 
by simulating the distribution of pairwise similarity 
scores for all annotated ontology term sets and normaliz- 
ing using a power transformation. Similarly to our 
experiments, their method might improve when the 
exact score distribution is computed using our algorithm. 

In a practical implementation of our algorithm, the 
P-values could be precomputed for each entry in the data- 
base (such as all the diseases in OMIM or each protein in 
the human proteome). For small q, the P- values could be 
calculated dynamically. This might be useful if users are 
allowed to filter out portions of the database from the 
search based on some predefined groups (for instance, in 
genetics, the differential diagnosis might be restricted to 
diseases showing a certain mode of inheritance). 

Due to its simple structure the new algorithm could 
be parallelized to run with several threads with close to 
linear speedup, by keeping the scores in different hash 
structures for each thread and merging all hashes at the 
end to get the complete distribution. Also, as often only 
the P-value is of interest, a branch and bound formula- 
tion of the new algorithm might lead to a significant 
speedup in practice. 

Conclusions 

The algorithmic improvement reported here might 
prove useful for P-value computation of other semantic 
similarity measures that are based on the information 
content of a node as introduced by Resnik [12]. How- 
ever, when the similarity score includes more dependen- 
cies the size of the complete score distribution may 
increase significantly. Further algorithmic development 
will be necessary to increase the class of similarity mea- 
sures for which P-values can be computed efficiently. 

We believe that our methods would be applicable to 
other applications in which users search for domain 
objects that best exemplify a set of desired attributes and 
that they can be used to improve bioinformatic methods 
that use the semantic similarity scores alone. For that pur- 
pose we implemented a software in Java that computes 
exact score distributions for both similarity measures dis- 
cussed here. The software works with any ontology avail- 
able in OBO format and is available for non-commercial 
and academic usage under: https://compbio.charite.de/ 
svn/hpo/ trunk/ sre/ tools/ significance/ 



Appendix A 

In this Appendix, we will prove Theorem 1 for arbitrary q. 
In the following text, we will outline the approach of the 
proof and introduce a few new definitions. We can calcu- 
late the P-values, Eq. (6), by computing the frequency F t of 
each score 5, in the score distribution, i.e., by calculating 
the number of queries that result in score 5, for each pos- 
sible score. We will consider all query sets Q that result in 
score S, denoted as Qs later in Eq. (15). These initial query 
sets consist of the nodes from the Ontology DAG 
Q = {V, E). Subsequently, we will substitute sets of nodes 
Q by multisets M^Q) over their target set similarity 
scores in Eq. (16). This is the important switch that estab- 
lishes the independence of the number of nodes in the 
graph by only considering their target set similarity scores. 
At this step, changing from sets to multisets is necessary, 
because the same target set similarity score may occur 
more than once given nodes in a single Q. However, the 
induced multisets from all sets in Qs are themselves not 
unique and therefore we will use the multiset frequency, 
Eq. (12), over the set of unique multisets M q s given Qs to 
compute the desired quantity F in the proof. 

We are interested in the set Qs of all sets n q } of 

nodes n q } £ V, which result in the same average 

score S. That is, Qs is the set of all queries of size q that 
result in the same average score S: 

Qs = {{ni, . . . , ...,«,} c V, iim"*({m, . . . , n,}, T) = S). (15) 

The core message of Theorem 1 is that we can define 
a multiset M q over the target set similarity scores s 
whose frequency can be used to compute the frequency 
F of scores S in the score distribution. A necessary first 
step therefore is to express a query set 
Q = {ni, . . . , n q \ c V as a multiset A-f?(Q): 

M*{Q) = {(si,mi), . . . , [so, m 0 )\ Si e U Q , m l = mgU16) 

where 

Uq = {Si\tii e Q,sim{rii,T) = s,} (17) 
and 

= \{m\rn e Q,sim{n u T) = s,}|. (18) 

The underlying set UQ for a multiset M1(Q) consists 
of all existing distinct target set similarity scores s,- of 
the nodes in Q, Eq. (17), and their multiplicity is the 
number of nodes in Q that share the same score s„ Eq. 
(18). 

Now that we know how to create a multiset of target 
set similarity scores from any given set of nodes in V, 
we need another variable M q s to represent all distinct 
multisets that can be generated using Eq. (16) from the 
set Qs- The set of distinct multisets Mj generated for a 
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given Qs is defined as: 
Ml = {Mt{Q)\Qz Q s }. 



(19) 



We can now state the proof of Theorem 1 as follows. 
Proof. 



F=\Q S \ 

- e n 

= J2 /^(M") 



IN, | 
m 



MleM'' 



E 



frecl{Ml) 



M«eM;;„, sim""S(M'l)=S 



(20) 



(21) 



(22) 



(23) 



Eq. (20) merely restates the definition of the Fre- 
quency F given by Eq. (15), namely the number of all 
queries Q C V that result in sim avg = S. Note that Eq. 
(15) is representing the number of such queries in terms 
of sets of nodes of the ontology. Eq. (21) switches the 
representation from nodes in V to multisets Mj over the 
similarity scores of nodes in V using Eq. (19) and the 
definition of multiset frequency given in Eq. (12). Eq. 

(22) follows directly from the definition of the multiset 
frequency in Eq. (12). The equality between Eq. (22) and 

(23) is a direct consequence of Eq. (15) and (19). 

Additional material 



Additional file 1: Additional File 1 contains some additional plots 
showing the differences in ranking by exact and sampled P-values 
for Clinical Diagnostics with the HPO 
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